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Abstract 

This paper proposes a new robust regression interpretation of sparse penalties such as the 
elastic net and the group-lasso. Beyond providing a new viewpoint on these penalization 
schemes, our approach results in a unified optimization strategy. Our evaluation experi- 
ments demonstrate that this strategy, implemented on the elastic net, is computationally 
extremely efficient for small to medium size problems. Our accompanying software solves 
problems at machine precision in the time required to get a rough estimate with competing 
state-of-the-art algorithms. 

Keywords: sparse regression, robust optimization, lasso, elastic net, group-lasso 



1. Introduction 

Robust optimization aims at solving problems where data is uncertain. This viewpoint 
has recently been investigated in machine learning, providing novel interpretations of well- 



known methods, thereby opening new perspectives for their analysis (Ben-Tal et al. , 2009). 



Support vector machines can be interpreted in the robust optimization framework under 
the corrupted location model (Caramanis et al. , 2012), which assumes that uncertainty 



affects input variables. More closely related to our present concerns. Lasso has been shown 
to be the solution of a robust least squares problem where input variables are assumed to 



be corrupted by a feature- wise disturbance (Xu et al. 2010) 



In this paper, we introduce a new robust regression interpretation of sparse penalties, 
where both the design matrix and the output variable are assumed to be corrupted. We 
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suppose that responses are affected by an adversarial noise, and according to the properties 
of this noise, several sparsity-inducing penalties can be recovered. We use this formalism 
to derive a general-purpose algorithm for solving classical sparse regression problems. 

Robust optimization leads to formalizing minimax problems. In our approach, the inner 
problem is a simple unconstrained quadratic problem, and we propose an optimization 
strategy based on the iterative resolution of plain quadratic problems. The approach is 
highly competitive compared to state-of-the-art optimization strategies for sparse regression 
such as coordinate descent (Fu 1998) or proximal methods (Beck and Teboulle, 2009). The 



general-purpose algorithm introduced in this paper relies on second order techniques to 
solve a series of quadratic approximations to the sparse regression problem. Our solver is 
thus applicable to small and medium size problems (that is, problems from several hundreds 
to a few thousands of active variables on a current computer) , where it is very accurate and 
faster than the first order techniques mentionned above. 

Section [2] introduces our general robust regression formalization, which allows numerous 
variants that follow from the definition of the uncertainty set on the adversarial noise, 
thereby leading to different sparse regression problems. Section[3]fully details the derivations 
for the Lasso and the group-Lasso (using the £00,1 mixed-norm) problems, applied together 
with an £2 ridge penalty (leading to what is known as the elastic net for the Lasso). A 
description of the general-purpose active set algorithm derived from this formalism is given 
in Section [4j which also introduces a new bound on the optimality gap stemming from 
the new formalization and the resolution scheme. Finally, Section [5] demonstrates that our 
solver is highly efficient compared to existing algorithms and popular implementations. 



2. Robust Optimization Framework 

As a simple motivating example, we consider a model where the response variable Y is 
assumed to be linearly related to the p predictor variables X = {Xi, . . . , Xp): 



Y = Xf3* + e 



(1) 



where is the vector of unknown parameters, assumed to be sparse, and e is a perturbation 
variable. We wish to estimate the regression coefficients /3* based on training data consisting 
of an n xp design matrix X and a response vector y G M". Here, we assume some deviations 
from the standard linear model, by considering that several sources of uncertainty affect 
the training data. 

Assumption 1 The design matrix X has been corrupted by an unobserved additive pertur- 
bation Ax G T^x., such that 

where \\-\\p is the Frobenius norm. The "true" design matrix corresponding to the X 
variable of Equation ([T]) is thus X — Ax. In addition, the perturbation e affects the linear 



1. This classic uncertainty set of the robust regression literature (Caramanis et al. 20121 may appear in 
stronger forms (2-induced norm for 'Chandrasekaran et al. 19981, (spectral radius of Ax for El Ghaoui 
and Lebret 1997 1. Though defining different sets, these options lead here to the same worst-case solution. 
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relationship between the observed X and y via the following decomposition: 



e = X7 + e 



(2) 



where 7 does not follow the sparsity pattern of (3* . The X7 component of e "adversarially" 
affects the linear relationship between X and y, while e represents a "neutral" perturbation, 
which is assumed to belong to the following uncertainty set: 



where \\-\\2 is the Euclidean norm (more generally, \\-\\p will denote the £p-norm). These 
assumptions entail the following relationship between X and y: 



That is, the observed responses are formed by summing the contributions of the unobserved 
clean inputs, the adversarial noise that maps the observed inputs to the responses, and the 
neutral noise. 

We will examine several assumptions about the adverse regression coefficients 7 that will 
be discussed in details in Section^ They state that 7 G V-y, where V-y is a convex neigh- 
borhood of in W, and also that the global uncertainty set on (Ax, £,7) is the Cartesian 
product Px X X . 

Note that we only consider uncertainty sets with finite support, which are amenable 
to a worst-case analysis. These sets are also relevant for unbounded deviations, in which 
case they typically represent the confidence regions associated to a prescribed confidence 
level. In particular, the neutral noise model ([S]) can be derived as a confidence interval for 
the centered homoscedastic Gaussian noise that is usually assumed in the linear regression 
model. 

Given the observed training data (X,y), robust estimation considers the problem of 
minimizing the worst-case scenario, so as to ensure a valid estimation for any perturbation 
belonging to the uncertainty set. For the time being, even without specifying the form of 
the uncertainty set V-^, one can derive a simple equivalent form of the robust regression 
problem. 

Proposition 1 The robust regression problem 



where Px, ^7 andV^ are defined as in Assumption^ is equivalent to the robust regularized 
regression problem: 




(3) 



y = (X - Ax)/3* + X7 + e . 





min max ||X/3 — ylU + f?v 11/3 — 7 



(5) 



See proof in Appendix A.l 
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In practice, Problem ([s]) will be solved by considering an equivalent form amenable to 
a simpler resolution in /3 for any 7, that is: 

min max I |X/3 — y| In + A 1 1/3 — 7I I9 , (6) 

where there is a one-to-one mapping between r/x and A. We now proceed to the definition 
of the uncertainty set V-y. Two options are proposed, each one entailing an equivalence 
with a well-known sparse regression method. 

3. Assumptions on the Spurious Regression Coefficients 

Though we envision several other applications of our framework, we present here two sim- 
ple examples following the same pattern: assuming a given regularity on the regression 
coefficients /3*, we consider the adversarial dual assumption on the spurious coefficients 7. 
When the initial regularity conditions on (3* are expressed by ii or too norms, this process 
results in uncertainty sets P-y which are convex polytopes that are easy to manage when 
solving Problem ([6]), since they can be defined as the convex hulls of a finite number possible 
perturbations. 

The two sparsity-inducing penalizers presented below have a grouping effect. The elastic 
net implements this grouping without predefining the group structure: strongly correlated 



predictors tend to be in or out of the model together (Zou and Hastie, 2005). The ^00,1 
group-Lasso that is presented subsequently is based on a prescribed group structure and 
favors regression coefficients with identical magnitude within activated groups. 

3.1 Elastic Net 

As an introductory example, let us consider the regularity assumption stating that the 
^i-norm of /3* should be small: 

The dual assumption is that the ^00-norm of 7 should be controlled, say: 



-T-)Lasso 
7 



7 G : sup 7T/3 < 1 

{7GMP:||7|L<??^} 
conv{ {-Tl^^Tl^Y ] , 



where rj^y = l/rjp and conv denotes convex hull, so that Problem (|6]) reads: 

n max 



min max ^ | ||X/3 - y||2 + A ||/3 - 7II2 | 



4^min||X/3-y||^ + 2A?7^||/3||i + A||/3||2 , (7) 



which is recognized as an elastic-net problem. When rj^ is null, we recover ridge regression, 
and when the magnitude of r/^ grows, the problem approaches a Lasso problem. A 2D 
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Figure 1: Admissible sets (represented by colored patches) for the elastic- net. Here, these 
sets are defined by the intersection of the Euclidean balls whose centers are rep- 
resented by crosses and boundaries are displayed in black. 



pictorial illustration of this evolution is given in Figure [T| where the shape of the admissibe 
set T>~^ is the convex hull of the points located at (±^77, =1=^/7)^ identified by the cross markers, 
thus defining the admissible set (3 by the intersection of Euclidean balls centered at these 
points. 

3.2 Group-Lasso 



We consider here the ^00,1 variant of the group-Lasso, which was first proposed by Turlach 



et al. (2005) to perform variable selection in the multiple response setup, which is first 
detailed before introducing the general situation. Following the previous example, we now 
consider the regularity assumption stating that the ^oo-norm of /3* should be small: 

The dual assumption is that the £i-norm of 7 should be controlled: 

pMax ^ J ^ g . g^jp < 1 

I /3eW^J"= 
= {7eMP: II7II1 <r?7} 

= conv {ri^el, . . . , 7776^, -rj^el, . . . , -r]^el] , 

where 7^7 = 1//?^ and is the jth element of the canonical basis of R^, that is ejji = 1 if 
j = j' and Cjji = otherwise. Then, Problem ^ becomes: 

min max \ ||X/3 — yllo + A 11/3 — 7II9 \ 
/3ei;p 7ex'^ax 1 1 1 ^ ■^112^ "12/ 

4^ min ||X/3-y||2 + 2Ar/7||/3||^ + A||/3||2 , 
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Now, consider the more general situation where a group structure is defined on the set 
of variables by setting a partition of the index set I = {1, . . . ,p}, that is, 



K 



T=[jQk , with gkngi = i/} for k^£ . 



k=l 



Let pk denote the cardinality of group k, and (3g^ G M^* be the vector {(3j)j^g^. We now 
consider the regularity assumption stating that the ^00,1 mixed-norm of /3* (that is, its 
groupwise £oo-norm) should be small: 



The dual assumption is that the groupwise £i-norm of 7 should be controlled: 



P7 



7 € MP : sup 7T/3 < 1 

■13* 



|7GMP:X;i|7gJ|i<^7| 



= conv{ {7?^ef , . . . , r/^ePJ , -7?^ef , . . . , -r]^eP\ } x . . . 

X |?7^eP^' , . . . , Tj^G^'^ , "'^761^ ) • • • ) ~'n'Y^p^ } } ) 

so that Problem ^ becomes: 

^{||X/3-y||^ + A||/3-7ll2} 



mm max 



K 



k=l 

Notice that the limiting cases of this penalty are two classical problems: ridge regression 
and the £00,1 group-Lasso. 

4. Algorithm 

The unified derivation for the problems presented in Section [3] suggests a unified processing 
based on the iterative resolution of quadratic problems. Our algorithm is summarized in 



this section. We then describe an alternative to Fenchel duality (used for example by Bach 



et al. 2012) to assess convergence, by computing an upper-bound for the gap between the 



current solution and the optimal one. 



4.1 Active Set Approach 



The efficient approaches developed for sparse regression take advantage of the sparsity 
of the solution by solving a series of small linear systems, whose sizes are incrementally 
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Figure 2: Admissible sets (represented by colored patches) for the ^00,1 group-Lasso. Here, 
these sets are defined by the intersection of the Euclidean balls whose centers are 
represented by crosses and boundaries are displayed in black. 



increased/decreased. Here, as for the Lasso (Osborne et al. , 2000 Efron et al. , 2004), this 



process boils down to an iterative optimization scheme involving the resolution of quadratic 
problems. 

The algorithm starts from a sparse initial guess, say (3 = 0, and iterates the three 
following steps: 

1. the first step solves Problem ^ with respect to /34, the subset of "active" variables, 
currently identified as being non-zero. This penalized least squares problem is defined 
from X._4, which is the submatrix of X comprising all rows and the columns indexed 
by A and 7^4, which is set to its current most adversarial value. 

2. the second step updates if necessary (and possibly 7^4), so that 74 is indeed (one 
of) the most adversarial value of the current /34. This is easily checked with the 
problems given in Section [sj where 2?-y is a convex polytope whose vertices (that is, 
extreme 7-values) are associated with a cone of coherent /3-value. 

3. the last step updates the active set A. It relies on the "worst-case gradient" with 
respect to (3, where 7 is chosen so as to minimize infinitesimal improvements of the 
current solution. Again picking the right 7 is easy for the problems given in Section |3| 
Once this is done, we first check whether some variables should quit the active set, 
and if this is not the case, we assess the completeness of A, by checking the optimality 
conditions with respect to inactive variables. We add the variable, or the group 
of variables that most violates the worst-case optimality conditions. When no such 
violation exists, the current solution is optimal, since, at this stage, the problem is 
solved exactly within the active set A. 

Algorithm [T] provides a more comprehensive technical description. Note that the structure 
is essentially identical to the one proposed by Osborne et al. (2000) or Efron et al. (2004) for 



the Lasso. The viewpoint is however radically different, as the global non-smooth problem 



2. When several 7^ are equally unfavorable to /3^, we use gradient information to pick the worst one among 
those when moves along the steepest descent direction. 
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Algorithm 1: Worst-Case Quadratic Penalty Active Set Algorithm 

Init. /3 ^ /3° 

Determine the active set: A ■'^ {j : P > 0} 

Pick a worst admissible 7, that is, 7 = arg max — g||^ 

step 1 Update active variables assuming that A and 7^4 are optimal 

^ Pa 

Pa ^ {^:a^.a + Al|^|)-' (x> + A7^) 

step 2 Verify coherence of 7_4 with the updated 

// if 7_4 is not worst-case 

*/ 



if \\Pa-^a\\2 < max II/34 -g4||2 then 



/* Backtrack towards the last 7_4-coherent solution: 

PA^Pt + PiPA-PT) 

7_4 is worst-case for /34, and there is another worst-case value 7_4 
/* Check whether progress can be made with 74 

Pa ^ i^^A^-A + AI|^|)"' (XT^y + A7^) 



*/ 



if 



Pa-^a 



max 



Pa-sa 



then 



L (Pa^a) ^ (Pa^a) 



/* The current 7^4 is coherent with (3^ 

step 3 Update active set A 



9j 



mm 



x}(X.^/3^-y) + A(/3,--7,: 



// if 7^ is worst-case. . . 
// iPAnA) is better than (/34,7^) 

*/ 

l,...,p // worst-case gradient 



// Downgrade j 
*/ 



if 3 j £ A : f3j = and gj = then 
A^A\{j}; 
/* Go to Step 1 
else 

if maxjg_4c Qj ^ then 

/* Identify the greatest violation of optimality conditions: */ 

j* -ir- arg max gj , ^ ^ ^ U {j*} ; II Upgrade j* 



/* Go to Step 1 



else 



1^ /* Stop and return /3, which is optimal 



is dealt with subdifferentials by Osborne et al. (2000), whereas we rely on the maximum of 



smooth functions, suggesting the assessment of convergence detailed below. 
4.2 Monitoring Convergence 

At each iteration of the algorithm, the current /3 is computed assuming that the current 
active set A and the current 7_4-value are optimal. When the current active set is not 
optimal, the current /3 (where P^ completed by zeros on the complement A^^ is never- 
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theless optimal for a 7-value defined in (where 7^4 is completed by ad hoc values on the 
complement A^). However this 7 fails to belong to P-y (otherwise, the problem would be 
solved: A, 7 and (3 would indeed be optimal). The following proposition relates the current 
objective function, associated with an infeasible 7-value (7 ^ 'D-y), to the global optimum 
of the optimization problem. 

Proposition 2 For any i]^ > 0, and for all vectorial norm \\-\\^, when Dj is defined as 
V-y = {'y eW : ||7||^ < rj^}, then, V7 G : ||7||,, > we have: 



where 



Ja(/3,7) = ||X/3-y||^ + A||/3-7||^ and /3*(7) = arg min Ja(/3,7) • 



5*66 proof in Appendix A.S\ 



This proposition can be used to compute an optimality gap at Step 3 of Algorithm [T} 
by picking a 7-value such that the current worst-case gradient g is null (the current /3- value 
then being the optimal (3*{-f)). Other options could be considered, but they would result 
in significant extra computation. The optimality gap computed from Proposition [2] differs 
from the Fenchel duality gap (see Bach et al. , 2012). For the elastic net expressed in Q, 



Fenchel inequality (see details in Mairal, 2010) yields the following optimality gap: 

2 

min max M(3n') >Jx (/?* (7) ,7) - ^ (l|X/3* (7) - y||^ + A ||/3* (7)!!^ 

l3£M.Pf'£T>.y II7II* 

-2^(X/3'^(7)-y)V . 
II7IU 

The two optimality gaps are empiricaly compared in Figure [3] for the elastic net, along a 
short regularization path with five values of the £i-penalization parameter. Fenchel's duality 
gap is more accurate here, especially for the rougher solutions. However, the practical use 
of these upper bounds is rather limited as they are both fairly coarse unless a very accurate 
solution is reached. These optimality gaps may thus be used to assess convergence, but 
they are ineffective as stopping rules when fast and rough solutions are sought. 



5. Numerical Experiments 

This section compares the performances of our algorithm to its state-of-the-art competitors 
from an optimization viewpoint. Efficiency may then be assessed by accuracy an speed: 
accuracy is the difference between the optimum of the objective function and its value at 
the solution returned by the algorithm; speed is the computing time required for returning 
this solution. Obviously, the timing of two algorithms/packages has to be compared at 
similar precision requirements, which are rather crude in machine learning, far from machine 
precision ( Bottou and Bousquet 2008 ) . 
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n = 50, p = 200 




# of iterations 



Figure 3: Monitoring convergence: true optimality gap (solid black) versus our pessimistic 
bound (dashed blue) and Fenchel's duality gap (dotted red) computed at each 
iteration of Algorithm [T| 



We compare the performance of the proposed quadratic solver with representatives of 
the most successful competing algorithms. First, we use our own implementations of all 
competitors, so as to provide comparisons without implementation biases (language, library, 
etc.). We use R with most of the matrix calculus done in C++ using the RcppArmadillo package 
( [Eddelbuettel and Frangois 2011 Sanderson, 2010) that relies on BLAS/LAPACK libraries for 
linear algebra operations. Second, we compare our code to the leading standalone packages 
that are available today, so as to provide comparisons avoiding a possible competence bias. 



We use simulated data to obtain representative average results. Their generation covers 
the typical attributes of the real data encountered in post genomic and signal processing. 
In these domains, the main optimization difficulties result from ill-conditioning, which is 
either due to the high correlation between predictors, or to underdetermination when the 
number of variables exceeds the sample size (also known as the high-dimensional or the 
"large p small n" setup). For the optimization algorithms based on active set strategies, 
bad conditioning is somehow alleviated when the objective function has a regular behavior 
when restricted to the subspace containing the solution. All other things being equal, this 
local conditioning is thus governed by the sparsity of the unknown true parameter (which 
affects the sparsity of the solution), which also heavily impacts the running times of most 
optimization algorithms available today. 
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5.1 Data Generation 

The above-mentioned characteristics are explored without difficulty in the framework of 
linear regression. We generate samples of size n from the model 

y = Xp* + e, e~AA(0,cj2l) , 

with a chosen so as to reach a rather strong coefficient of determination [R^ ~ 0.8). The 
design matrix X is drawn from a multivariate normal distribution in M^, and the condi- 
tioning of X^X is ruled by the correlation between variables. We use the same correlation 
coefficient p for all pairs of variables. The sparsity of the true regression coefficients is 
controlled by a parameter s, with 

/3*= (2,...,2, -2,. . .,-2 ,0,...,q) . 

s/2 s/2 P-S 

Finally, the ratio p/n quantifies the well/ill-posedness of the problem. 



5.2 Comparing Optimization Strategies 



We compare here the performance of three state-of-the-art optimization strategies imple- 
mented in our own computational framework: accelerated proximal method (see, e.g., Beck 



and Teboulle, 2009), coordinate descent (popularized by Friedman et al. , 2007), and our 



Tenet 



|X/3 



y|l2 + Ai||/3||i + ^ 



mil 



(8) 



algorithm, that will respectively be named hereafter proximal, coordinate and quadratic. 
Our implementations estimate the solution to elastic-net problem 

1 
2 

which is strictly convex when A2 > and thus admits a unique solution even if n < p. 

The three implementations are embedded in the same active set routine, which approx- 
imately solves the optimization problem with respect to a limited number of variables as in 
Algorithm [T| They only differ regarding the inner optimization problem with respect to the 
current active variables, which is performed by an accelerated proximal gradient method 
for proximal, by coordinate descent for coordinate, and by the resolution of the worst- 



case quadratic problem for quadratic. We followed the practical recommendations of Bach 



et al. (2012) for accelerating the proximal and coordinate descent implementations, and we 



used the same halting condition for the three implementations, based on the approximate 
satisfaction of the ffist-order optimality conditions: 



max 
j{ei...p} 



xny-X/3 +A2/3 



< Al + T, 



(9) 



where the threshold r was fixed to r = 10~^ in our simulations. Finally, the active set 
algorithm is itself wrapped in a warm-start routine, where the approximate solution to 
•^Ai A2 used as the starting point for the resolution of Jy^^x2 < Ai. 

Our benchmark considers small-scale problems, with size p = 100, and the nine situa- 
tions stemming from the choice of three following parameters: 



3. The rather loose threshold is favorable to coordinate and proximal, which reach the threshold, while 
quadratic ends up with a much smaller value, due to the exact resolution, up to machine precision, of 
the inner quadratic problem. 
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• low, medium and high levels of correlation between predictors (p G {0.1,0.4,0.8}), 

• low, medium and high-dimensional setting [p/n G {2, 1,0.5}), 

• low, medium and high levels of sparsity [s/p G {10%, 30%, 60%}). 

Each solver computes the elastic net for the tuning parameters Ai and A2 on a 2D-grid of 
50 X 50 values, and their running times have been averaged over 100 runs. 

All results are qualitatively similar regarding the dimension and sparsity settings. Fig- 
ure [4] displays the high-dimensional case {p = 2n) with a medium level of sparsity {s = 30) 
for the three levels of correlation. Each map represents the log-ratio between the timing 
of either coordinate or proximal versus quadratic, according to (Ai,A2) for a given corre- 
lation level. Dark regions with a value of 1 indicate identical running times while lighter 
regions with a value of 10 indicate that quadratic is 10 times faster. Figure [4] illustrates 
that quadratic outperforms both coordinate and proximal, by running much faster in most 
cases, even reaching 300-fold speed increases. The largest gains are observed for small 
(Ai, A2) penalty parameters for which the problem is ill-conditioned, including many active 
variables, resulting in a huge slowdown of the first-order methods coordinate and proximal. 
As the penalty parameters increase, smaller gains are observed, especially when A2, attached 
to the quadratic penalty, reaches high values for which all problems are well-conditioned, 
and where the elastic-net is leaning towards univariate soft thresholding, in which case all 
algorithms behave similarly. 



5.3 Comparing Stand-Alone Implementations 

We now proceed to the evaluation of our code with three other stand-alone programs pub- 
licly available as R packages. We chose three leading state-of-the-art packages, namely 



glmnet (Generalized Linear Models regularized by Lasso and elastic-NET, Friedman et al. 



2010), lars (Least Angle Regression, lasso and forward Stagewise, Efron et al. , 2004) and 



SPAMS (SPArse Modehng Software, Bach et al. , 2012), with two options SPAMS-FISTA, which 



implements an accelerated proximal method, and SPAMS-LARS which is a lars substitute. 
Note that glmnet does most of its internal computations in Fortran, lars in R, and SPAMS 
in C++. Our own implementation, by resolution of worst-case quadratic problem, is shipped 
within an R package quadrupen publicly available at the second author's web pag^ 

We benchmark these packages by computing regularization paths for the Lasscj^ that 
is, the elastic-net Problem ([s]) with A2 = 0. The inaccuracy of the solutions produced is 
measured by the gap in the objective function compared to a reference solution, considered 
as being the true optimum. We use the lars solution as a reference, since it solves the Lasso 
problem up to the machine precision, relying on the LAPACK/BLAS routines. Furthermore, 
lars provides the solution path for the Lasso, that is, the set of solutions computed for each 
penalty parameter value for which variable activation or deletion occurs, from the empty 
model to the least-mean squares model. This set of reference penalty parameters is used 
here to define a sensible reproducible choice. 



4. or directly at http : // stat . genopole . cnrs . f r/logiciels/quadrupen. This will be made available 
on the CRAN (Comprehensive R Archive Network) 

5. We benchmark the packages on a Lasso problem since the parametrization of the elastic-net problem 
diflters among packages, hindering fair comparisons. 
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logio A2 logio ^2 



Figure 4: Log-ratio of computation times for coordinate (left), and proximal (right), versus 
quadratic, for (p, n) = (100,50), s = 30 and correlation p E {0.1,0.4,0.8} (top, 
middle and bottom respectively). 
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In high dimensional setups, the computational cost of returning the solutions for the 
largest models may be overwhelming compared to the one necessary for exploring the in- 



teresting part of the regularization path (Simon et al. , 2011: Friedman et al. , 2010). This 



is mostly due to numerical instability problems that may be encountered in these extreme 
settings, where the Lasso solution is overfitting as it approaches the set of solutions to the 
underdetermined least squares problem. We avoid a comparison mostly relying on these 
spurious cases by restricting the set of reference penalty parameters to the first min(n,p) 



steps of lars (similar settings are used by Friedman et al. , 2010 1 . 



Henceforth, the distance D of a given method to the optimum is evaluated on the whole 
set of penalties A used along the path, by 

where J^^^'^{/3) = J^^Q^if^) is the objective function of the Lasso evaluated at /3, and 
is the estimated optimal solution provided by the method package currently tested. 

The data sets are generated according to the linear model described above, in three 
different high-dimensional settings and small to medium number of variables: {p, n) = 
(100,40), {p,n) = (1000,200) and {p,n) = (10 000,400). The sparsity of the true underlying 
(3* is governed by s = 0.25min(n,p), and the correlation between predictors is set by 
p G {0.1, 0.4, 0.8}. For each value of p, we averaged the timings over 50 simulations, ensuring 
that each package computes the solutions at identical A values, as defined above. 

We pool together the runtimes obtained for the three levels of correlation for quadrupen, 
SPAMS-LARS and lars, which are not sensible to the correlation between features. In each 
plot of Figure [5j each of these methods is thus represented by a single point marking the 
average precision and the average distance to the optimum over the 150 runs (50 runs for 
each p G {0.1,0.4,0.8}). Note that for lars only the abscissa is meaningful since D(lars) 
is zero by definition. Besides, quadrupen, which solves each quadratic problem up to the 
machine precision, tends to be within this precision of the lars solution. The SPAMS-LARS is 
also very precise, up to 10~®, which is the typical precision of the approximate resolution 
of linear systems. It is the fastest alternative for solving the Lasso when the problem is 
high- dimensional with a large number of variables (Figure [5| bottom-left). 

In contrast, the (precision, timing)-values of glmnet and SPAMS-FISTA are highly affected 
by the threshold parameter^ that control their stopping conditions. The computational 
burden to reach a given precision is also affected by the level of correlation, as illustrated in 
Figure [5j Obviously, a precise solution is difficult to reach with first-order descent algorithms 
in a high correlation setup, which corresponds to an ill-conditioned linear system. It may 
be surprising to observe that SPAMS-FISTA is about ten time slower than glmnet, as proximal 
and cooordinate descent methods were experimentally shown to be roughly equivalent in our 



preceding analysis and by Bach et al. (2012). However, these two comparisons were carried 



out with the same active set strategy (that is, with active set for ours and without active 



In glmnet, convergence is monitored by the stabihty of the objective function, measured between two 
optimization steps, and optimization is hahed when changes faU below the specified threshold (scaled 
by the null deviance). In SPAMS-FISTA, the stopping condition of the algorithm is based on the relative 
change of parameters between two iterations. 
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Figure 5: Distance to optimum versus CPU time for three different high-dimensional 
settings: (p, n) = (100,40) (top left), {p,n) = (1000,200) (top right) and 
p = (10 000,400) (bottom left). 



15 



Grandvalet, Chiquet and Ambroise 



set for Bach et al. , 2012). We believe that this difference in the handhng of active variables 



explains the relative bad performance of SPAMS-FISTA, which optimizes all variables along 
the regularization path, while glmnet uses a greedy active set strategy. 

Overall, our implementation is highly competitive, that is, very accurate, at the lars 
level, and much faster. The speed improvements of glmnet are only observed for very rough 
approximate solutions and SPAMS-FISTA is dominated by glmnet. Our experiments, in the 
framework of active set methods, agree with the results of Bach et al. (2012): indeed. 



they observed that first-order methods are competitive with second-order ones only for low 
correlation levels and small penalties (which entails a large number of active variables). Con- 
versely, our results may appear to contradict some of the experimental findings of |Friedman 
et al. (2010): first, we observe that glmnet is quite sensitive to correlations, and second, the 



optimized second-order methods are competitive with glmnet. These differences in conclu- 
sions arise from the differences in experimental protocols: while we compare running times 
at a given accuracy, they are compared at a given threshold on the stopping criterion by 
Friedman et al. (2010). Regarding the influence of correlations, the stability-based criterion 



can be fooled due to the tiny step size that typically occurs for ill-conditioned problems, 
leading to a sizable early stopping. Regarding the second point, even though the R imple- 
mentation of lars may indeed be slow compared to glmnet, considerable improvements can 
be obtained using optimized second-order methods such as quadrupen as soon as a sensible 
accuracy is required, especially when correlation increases. 

Finally, among the accurate solvers, SPAMS-LARS is insignificantly less accurate than 
quadrupen or lars in a statistical context. It is always faster than lars and slightly faster 
than quadrupen for the largest problem sizes (Figure [sj bottom-left) and much slower for 
the smallest problem (Figure [5| top- left). 



5.4 Link between accuracy of solutions and prediction performances 



When the "irrepresentable condition" (Zhao and Yu 2006) holds, the Lasso should select 



the true model consistently. However, even when this rather restrictive condition is fulfilled, 
perfect support recovery obviously requires numerical accuracy: rough estimates may speed 
up the procedure, but whatever optimization strategy is used, stopping an algorithm is likely 
to prevent either the removal of all irrelevant coefficients or the insertion of all relevant ones. 
The support of the solution may then be far from the optimal one. 

We advocate here that our quadratic solver is very competitive in computation time 
when support recovery matters, that is, when high level of accuracy is needed, in small (few 
hundreds of variables) and medium sized problems (few thousands). As an illustration, we 
generate 100 data sets under the same linear model as above, with a rather strong coefficient 
of determination [R^ ~ 0.8 on average), a rather high level of correlation between predictors 
{p = 0.8) and a medium level of sparsity {s/p = 30%). The number of variable is kept low 
{p = 100) and the difficulty of the problem is tuned by the n/p ratio. For each data set, we 
also generate a test set sufficiently large (say, lOn) to evaluate the quality of the prediction 
without depending on any sampling fluctuation. We compare the Lasso solutions computed 
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by quadrupen to the ones returned by glmnet with various level of accuracjQ Figure [6| 
reports performances, as measured by the mean squared test error and the support error 
rate. 





Figure 6: Test performances according to the penalty parameter for the Lasso estimates 
returned by quadrupen and glmnet at various level of accuracy. Three high- 
dimensional setups are illustrated: from left to right n/p = 1/2, n/p = 1 and 
n/p = 2; top: mean squared test error; bottom: support error rate. 



7. This is done via the thresh argument of the glmnet procedure, whose default value is le-7. In our 
experiments, low, med and high level of accuracy for glmnet respectively correspond to thresh set to 
le-1, le-4, and le-9. 
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As expected, the curves show that selecting variables and searching for the best predic- 
tion are two different problems. The selection problem (bottom of Figure [g]) always requires 
a sparser model than the prediction problem. But despite this obvious difference, the more 
accurate the solution returned by the algorithm, the better the performances for any levels 
of penalty and for both performance measures. 



methods 


quadrupen 


glmnet (low) 


glmnet (med) 


glmnet (high) 


timing (ms) 


8 


7 


8 


64 


accuracy (dist. to opt.) 


5.9 X IQ-^^ 


7.2 X 10° 


6.04 X 10° 


1.47 X 10-2 



Table 1: Median timings and solution accuracies 

Now focusing on glmnet performances, the better the accuracy, the smaller the MSE 
and the support error rate. But the better the accuracy, the slower the algorithm becomes. 
Using the default settings allows to have a result very close to our quadratic solver, and 
the performance differences become negligible between our approach and glmnet running 
with high precision. However, Table [T] illustrates that high accuracy is achieved at a high 
computational cost: to be at par with quadrupen with regards to test performances, glmnet 
is about ten times slower than our solver. 



6. Discussion 

This paper presents a new viewpoint on sparsity-inducing penalties stemming from robust 
optimization. This viewpoint enables to cast in the same framework several well-known 
penalties, to derive a general-purpose algorithm that computes the solution to their com- 
panion penalized regression problem, and to obtain a lower bound on the minimum of 
the objective function that provides an assessment of convergence. The proposed algo- 
rithm solves a series of small-size quadratic problems defined from the currently estimated 
worst-case perturbation. It has been thoroughly tested and compared with state-of-the-art 
implementations for the elastic net and the Lasso, comparing favorably in most cases to all 
its competitors. 

We detailed here how the Lasso and the group-Lasso (with the ^oo.i mixed norm) penal- 
ties, possibly applied together with an £2 ridge penalty (leading to what is known as the 
elastic net for the Lasso) can be derived from this new robust optimization viewpoint. 
The algorithm can be adapted to non-quadratic loss functions for addressing other learning 
problems such as classification, but this generalization, which requires solving non-quadratic 
problems, may not be as efficient compared to the existing alternatives. We are now ex- 
amining how to address a wider range of penalties by extending the framework in two 
directions: first, to accommodate additional general £2 penalties in the form of arbitrary 
symmetric positive semidefinite matrix instead of the simple ridge, in particular to provide 



an efficient implementation of the structured elastic-net (Slawski et al., 2010) ; second, we 



plan to derive similar views on a wider range of sparsity-inducing penalties, such as the 



fused-Lasso or the OSCAR ( |Bondell and Reichj |2008| ). 
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Appendix A. Proofs 

A.l Proof of Proposition [l] 



first show that the objective function of Problem Q is, up to an irrelevant constant, 
upper bounded by the objective function of Problem 1^: 



(X- Ax)/3 + X7 + e-y||2 < ||X(/3 + 7) - y||2 + ||Ax/3||2 + Hells 

< ||X(/3 + 7)-y|l2 + ^x||/3|l2 + ^. 



(10) 
(11) 



where (10) stems from the triangular inequality and (11) follows from the definition of the 



uncertainty sets. The right-hand side of (11) is, up to an irrelevant constant, identical to 



the objective function of Problem ([s]) after the change of variable P <— j3 + 

We conclude, by showing that, for all (/3,7), there is (A^, e*) G T>y^ x Dg such that the 
upper bound on the right-hand side of ([TT| is reached: 



X(/3 + 7) - y 



X(/3 + 7)-y 



|x(/3 + 7)-y|l2 II/3II2 ''||x(/3 + 7)-y|l2 ■ 

With these choices, the vectors X(/3-|-7) — y, Ax/3 and e* are colinear, so that minimizing 



the left-hand-side of (10) is equivalent to minimize the right-hand-side of (11). 



A. 2 Proof of Proposition [2] 



We detail here a proof yielding a slightly tighter bound. Proposition [2] is simply a 
corollary of Proposition |4] stated and proved below. 

When T>-y is defined by a norm as in Proposition [2| the following Lemma relates the 
penalty associated with a infeasible 7- value (||7||^ > to the one obtained by shrinking 
this 7- value to reach the boundary of P-y. 

Lemma 3 Let S C {1, a G (0, 1), ||-| |^ be a vectorial norm and ip*{-,S, a) : — t- 

be defined as follows: 

99^(7, 5, a) = 75 



Then, 



Proof 



||/3-(^*(7,5,a)||^>a||/3-7ll2-«(l-«)ll75=ll2 • (12) 



11/3 -(^*(7,5,a)||^ = ||/35-75ll2 + « 11/35^- 75=112 + (l-«) 11/^5=112 -«(l-«)ll75^ll2 

> 11/^5 -75II2 +« 11/^5- -75-II2 -"(1 -") II75-II2 

> a 11/3 -7||2-a(l- a) 1175^=112 • 
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Proposition 4 For any r]^ > 0, and for all vectorial norm \\-\\^, when D-y is defined as 
V^ = {-feW: ||7||^ < r]^}, then, V7 G : ||7||^ > rj^, andy{S,a) G 2^1'-'^ x (0,1) such 
that ||(75;«75=)ll* <V-y, we have: 

ISii^ max -^a(/3,7') > a-^A (/3* (7) ,7) - Aa(l - a) ||75<=ll2 > 



where 

Ja(/3,7) = ||X/3-y||2 + A||/3-7||2 and (7) = arg mmJA(/3, 7) • 

Proof For all (3 (^W and for any (7, 5, a) G x 2^1'-'P> x (0, 1) such that 93* (7, 5, a) G 
P-j,, wii/i defined as in Lemma^ we have: 

max Ja(/3,7') > JA(/3,^*(7,5,a)) , (13) 

since f*{'y,S,a) belongs to T)~^. We now compute a lower hound of the right-hand- side for 
7 such that ||7||^ > r]^: 

Jx if3, ip*{j,S, a)) = a f- llX/3 - y||^ + - 11/3 - ^*{-f,S, a)\\i) 

>a(\\X(3-y\\l+^\\(3-^*i-f,S,a)\\l 
\ a 



>a(^||X/3-y||^ + A||/3-7||^j -Aa(l-a)||75c||^ , 

where the last inequality stems from Lemma^ This inequality holds for any given f3-value, 
in particular for P*{if*{'y,S,a)) = arg min^gjjp Jx {(3 , if* {j , S , a)) : 

min Jx {(3,ip*{y,S,a)) > aJx{P* {ip* {■y , S , a))) - Aa(l - a) \hsA\l 

>aJA(r (7),7)-Aa(l-a)||75=||2 , (14) 



where the second inequality follows from the definition of {'-/). Inequality (14) can be 
restated as: 

min Ja (/3, 99* (7, 5, a)) > a min Ja(/3, 7) - Aa(l - a) \\-fsc\\l . 



We finally remark that, since <p*{j,S,a) G "D-y, we trivially have: 
min max Ja(/3>7) > niin Ja (/3, (/5*(7, 5, a)) 



which concludes the proof. 
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